Loan Data Exploration by Wonjun Lee

##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."              
## [16] "ProsperScore"                       
## [17] "ListingCategory..numeric."          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage." 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"

Univariate Plots Section

## [1] 113937     81
##                    ListingKey     ListingNumber    
##  17A93590655669644DB4C06:     6   Min.   :      4  
##  349D3587495831350F0F648:     4   1st Qu.: 400919  
##  47C1359638497431975670B:     4   Median : 600554  
##  8474358854651984137201C:     4   Mean   : 627886  
##  DE8535960513435199406CE:     4   3rd Qu.: 892634  
##  04C13599434217079754AEE:     3   Max.   :1255725  
##  (Other)                :113912                    
##                     ListingCreationDate  CreditGrade         Term      
##  2013-10-02 17:20:16.550000000:     6          :84984   Min.   :12.00  
##  2013-08-28 20:31:41.107000000:     4   C      : 5649   1st Qu.:36.00  
##  2013-09-08 09:27:44.853000000:     4   D      : 5153   Median :36.00  
##  2013-12-06 05:43:13.830000000:     4   B      : 4389   Mean   :40.83  
##  2013-12-06 11:44:58.283000000:     4   AA     : 3509   3rd Qu.:36.00  
##  2013-08-21 07:25:22.360000000:     3   HR     : 3508   Max.   :60.00  
##  (Other)                      :113912   (Other): 6745                  
##                  LoanStatus                  ClosedDate   
##  Current              :56576                      :58848  
##  Completed            :38074   2014-03-04 00:00:00:  105  
##  Chargedoff           :11992   2014-02-19 00:00:00:  100  
##  Defaulted            : 5018   2014-02-11 00:00:00:   92  
##  Past Due (1-15 days) :  806   2012-10-30 00:00:00:   81  
##  Past Due (31-60 days):  363   2013-02-26 00:00:00:   78  
##  (Other)              : 1108   (Other)            :54633  
##   BorrowerAPR       BorrowerRate     LenderYield     
##  Min.   :0.00653   Min.   :0.0000   Min.   :-0.0100  
##  1st Qu.:0.15629   1st Qu.:0.1340   1st Qu.: 0.1242  
##  Median :0.20976   Median :0.1840   Median : 0.1730  
##  Mean   :0.21883   Mean   :0.1928   Mean   : 0.1827  
##  3rd Qu.:0.28381   3rd Qu.:0.2500   3rd Qu.: 0.2400  
##  Max.   :0.51229   Max.   :0.4975   Max.   : 0.4925  
##  NA's   :25                                          
##  EstimatedEffectiveYield EstimatedLoss   EstimatedReturn 
##  Min.   :-0.183          Min.   :0.005   Min.   :-0.183  
##  1st Qu.: 0.116          1st Qu.:0.042   1st Qu.: 0.074  
##  Median : 0.162          Median :0.072   Median : 0.092  
##  Mean   : 0.169          Mean   :0.080   Mean   : 0.096  
##  3rd Qu.: 0.224          3rd Qu.:0.112   3rd Qu.: 0.117  
##  Max.   : 0.320          Max.   :0.366   Max.   : 0.284  
##  NA's   :29084           NA's   :29084   NA's   :29084   
##  ProsperRating..numeric. ProsperRating..Alpha.  ProsperScore  
##  Min.   :1.000                  :29084         Min.   : 1.00  
##  1st Qu.:3.000           C      :18345         1st Qu.: 4.00  
##  Median :4.000           B      :15581         Median : 6.00  
##  Mean   :4.072           A      :14551         Mean   : 5.95  
##  3rd Qu.:5.000           D      :14274         3rd Qu.: 8.00  
##  Max.   :7.000           E      : 9795         Max.   :11.00  
##  NA's   :29084           (Other):12307         NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
##  Min.   :  0.00           False:56459         False:101218    
##  1st Qu.: 26.00           True :57478         True : 12719    
##  Median : 67.00                                               
##  Mean   : 96.07                                               
##  3rd Qu.:137.00                                               
##  Max.   :755.00                                               
##  NA's   :7625                                                 
##                     GroupKey                 DateCreditPulled 
##                         :100596   2013-12-23 09:38:12:     6  
##  783C3371218786870A73D20:  1140   2013-11-21 09:09:41:     4  
##  3D4D3366260257624AB272D:   916   2013-12-06 05:43:16:     4  
##  6A3B336601725506917317E:   698   2014-01-14 20:17:49:     4  
##  FEF83377364176536637E50:   611   2014-02-09 12:14:41:     4  
##  C9643379247860156A00EC0:   342   2013-09-27 22:04:54:     3  
##  (Other)                :  9634   (Other)            :113912  
##  CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.0         Min.   : 19.0        
##  1st Qu.:660.0         1st Qu.:679.0        
##  Median :680.0         Median :699.0        
##  Mean   :685.6         Mean   :704.6        
##  3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :880.0         Max.   :899.0        
##  NA's   :591           NA's   :591          
##         FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
##                     :   697     Min.   : 0.00      Min.   : 0.00  
##  1993-12-01 00:00:00:   185     1st Qu.: 7.00      1st Qu.: 6.00  
##  1994-11-01 00:00:00:   178     Median :10.00      Median : 9.00  
##  1995-11-01 00:00:00:   168     Mean   :10.32      Mean   : 9.26  
##  1990-04-01 00:00:00:   161     3rd Qu.:13.00      3rd Qu.:12.00  
##  1995-03-01 00:00:00:   159     Max.   :59.00      Max.   :54.00  
##  (Other)            :112389     NA's   :7604       NA's   :7604   
##  TotalCreditLinespast7years OpenRevolvingAccounts
##  Min.   :  2.00             Min.   : 0.00        
##  1st Qu.: 17.00             1st Qu.: 4.00        
##  Median : 25.00             Median : 6.00        
##  Mean   : 26.75             Mean   : 6.97        
##  3rd Qu.: 35.00             3rd Qu.: 9.00        
##  Max.   :136.00             Max.   :51.00        
##  NA's   :697                                     
##  OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries   
##  Min.   :    0.0             Min.   :  0.000      Min.   :  0.000  
##  1st Qu.:  114.0             1st Qu.:  0.000      1st Qu.:  2.000  
##  Median :  271.0             Median :  1.000      Median :  4.000  
##  Mean   :  398.3             Mean   :  1.435      Mean   :  5.584  
##  3rd Qu.:  525.0             3rd Qu.:  2.000      3rd Qu.:  7.000  
##  Max.   :14985.0             Max.   :105.000      Max.   :379.000  
##                              NA's   :697          NA's   :1159     
##  CurrentDelinquencies AmountDelinquent   DelinquenciesLast7Years
##  Min.   : 0.0000      Min.   :     0.0   Min.   : 0.000         
##  1st Qu.: 0.0000      1st Qu.:     0.0   1st Qu.: 0.000         
##  Median : 0.0000      Median :     0.0   Median : 0.000         
##  Mean   : 0.5921      Mean   :   984.5   Mean   : 4.155         
##  3rd Qu.: 0.0000      3rd Qu.:     0.0   3rd Qu.: 3.000         
##  Max.   :83.0000      Max.   :463881.0   Max.   :99.000         
##  NA's   :697          NA's   :7622       NA's   :990            
##  PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
##  Min.   : 0.0000          Min.   : 0.000            Min.   :      0       
##  1st Qu.: 0.0000          1st Qu.: 0.000            1st Qu.:   3121       
##  Median : 0.0000          Median : 0.000            Median :   8549       
##  Mean   : 0.3126          Mean   : 0.015            Mean   :  17599       
##  3rd Qu.: 0.0000          3rd Qu.: 0.000            3rd Qu.:  19521       
##  Max.   :38.0000          Max.   :20.000            Max.   :1435667       
##  NA's   :697              NA's   :7604              NA's   :7604          
##  BankcardUtilization AvailableBankcardCredit  TotalTrades    
##  Min.   :0.000       Min.   :     0          Min.   :  0.00  
##  1st Qu.:0.310       1st Qu.:   880          1st Qu.: 15.00  
##  Median :0.600       Median :  4100          Median : 22.00  
##  Mean   :0.561       Mean   : 11210          Mean   : 23.23  
##  3rd Qu.:0.840       3rd Qu.: 13180          3rd Qu.: 30.00  
##  Max.   :5.950       Max.   :646285          Max.   :126.00  
##  NA's   :7604        NA's   :7544            NA's   :7544    
##  TradesNeverDelinquent..percentage. TradesOpenedLast6Months
##  Min.   :0.000                      Min.   : 0.000         
##  1st Qu.:0.820                      1st Qu.: 0.000         
##  Median :0.940                      Median : 0.000         
##  Mean   :0.886                      Mean   : 0.802         
##  3rd Qu.:1.000                      3rd Qu.: 1.000         
##  Max.   :1.000                      Max.   :20.000         
##  NA's   :7544                       NA's   :7544           
##  DebtToIncomeRatio         IncomeRange    IncomeVerifiable
##  Min.   : 0.000    $25,000-49,999:32192   False:  8669    
##  1st Qu.: 0.140    $50,000-74,999:31050   True :105268    
##  Median : 0.220    $100,000+     :17337                   
##  Mean   : 0.276    $75,000-99,999:16916                   
##  3rd Qu.: 0.320    Not displayed : 7741                   
##  Max.   :10.010    $1-24,999     : 7274                   
##  NA's   :8554      (Other)       : 1427                   
##  StatedMonthlyIncome                    LoanKey       TotalProsperLoans
##  Min.   :      0     CB1B37030986463208432A1:     6   Min.   :0.00     
##  1st Qu.:   3200     2DEE3698211017519D7333F:     4   1st Qu.:1.00     
##  Median :   4667     9F4B37043517554537C364C:     4   Median :1.00     
##  Mean   :   5608     D895370150591392337ED6D:     4   Mean   :1.42     
##  3rd Qu.:   6825     E6FB37073953690388BC56D:     4   3rd Qu.:2.00     
##  Max.   :1750003     0D8F37036734373301ED419:     3   Max.   :8.00     
##                      (Other)                :113912   NA's   :91852    
##  TotalProsperPaymentsBilled OnTimeProsperPayments
##  Min.   :  0.00             Min.   :  0.00       
##  1st Qu.:  9.00             1st Qu.:  9.00       
##  Median : 16.00             Median : 15.00       
##  Mean   : 22.93             Mean   : 22.27       
##  3rd Qu.: 33.00             3rd Qu.: 32.00       
##  Max.   :141.00             Max.   :141.00       
##  NA's   :91852              NA's   :91852        
##  ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
##  Min.   : 0.00                       Min.   : 0.00                  
##  1st Qu.: 0.00                       1st Qu.: 0.00                  
##  Median : 0.00                       Median : 0.00                  
##  Mean   : 0.61                       Mean   : 0.05                  
##  3rd Qu.: 0.00                       3rd Qu.: 0.00                  
##  Max.   :42.00                       Max.   :21.00                  
##  NA's   :91852                       NA's   :91852                  
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding
##  Min.   :    0            Min.   :    0              
##  1st Qu.: 3500            1st Qu.:    0              
##  Median : 6000            Median : 1627              
##  Mean   : 8472            Mean   : 2930              
##  3rd Qu.:11000            3rd Qu.: 4127              
##  Max.   :72499            Max.   :23451              
##  NA's   :91852            NA's   :91852              
##  ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
##  Min.   :-209.00             Min.   :   0.0           
##  1st Qu.: -35.00             1st Qu.:   0.0           
##  Median :  -3.00             Median :   0.0           
##  Mean   :  -3.22             Mean   : 152.8           
##  3rd Qu.:  25.00             3rd Qu.:   0.0           
##  Max.   : 286.00             Max.   :2704.0           
##  NA's   :95009                                        
##  LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination   LoanNumber    
##  Min.   : 0.00                 Min.   :  0.0              Min.   :     1  
##  1st Qu.: 9.00                 1st Qu.:  6.0              1st Qu.: 37332  
##  Median :14.00                 Median : 21.0              Median : 68599  
##  Mean   :16.27                 Mean   : 31.9              Mean   : 69444  
##  3rd Qu.:22.00                 3rd Qu.: 65.0              3rd Qu.:101901  
##  Max.   :44.00                 Max.   :100.0              Max.   :136486  
##  NA's   :96985                                                            
##  LoanOriginalAmount          LoanOriginationDate LoanOriginationQuarter
##  Min.   : 1000      2014-01-22 00:00:00:   491   Q4 2013:14450         
##  1st Qu.: 4000      2013-11-13 00:00:00:   490   Q1 2014:12172         
##  Median : 6500      2014-02-19 00:00:00:   439   Q3 2013: 9180         
##  Mean   : 8337      2013-10-16 00:00:00:   434   Q2 2013: 7099         
##  3rd Qu.:12000      2014-01-28 00:00:00:   339   Q3 2012: 5632         
##  Max.   :35000      2013-09-24 00:00:00:   316   Q2 2012: 5061         
##                     (Other)            :111428   (Other):60343         
##                    MemberKey      MonthlyLoanPayment LP_CustomerPayments
##  63CA34120866140639431C9:     9   Min.   :   0.0     Min.   :   -2.35   
##  16083364744933457E57FB9:     8   1st Qu.: 131.6     1st Qu.: 1005.76   
##  3A2F3380477699707C81385:     8   Median : 217.7     Median : 2583.83   
##  4D9C3403302047712AD0CDD:     8   Mean   : 272.5     Mean   : 4183.08   
##  739C338135235294782AE75:     8   3rd Qu.: 371.6     3rd Qu.: 5548.40   
##  7E1733653050264822FAA3D:     8   Max.   :2251.5     Max.   :40702.39   
##  (Other)                :113888                                         
##  LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees   
##  Min.   :    0.0              Min.   :   -2.35   Min.   :-664.87  
##  1st Qu.:  500.9              1st Qu.:  274.87   1st Qu.: -73.18  
##  Median : 1587.5              Median :  700.84   Median : -34.44  
##  Mean   : 3105.5              Mean   : 1077.54   Mean   : -54.73  
##  3rd Qu.: 4000.0              3rd Qu.: 1458.54   3rd Qu.: -13.92  
##  Max.   :35000.0              Max.   :15617.03   Max.   :  32.06  
##                                                                   
##  LP_CollectionFees  LP_GrossPrincipalLoss LP_NetPrincipalLoss
##  Min.   :-9274.75   Min.   :  -94.2       Min.   : -954.5    
##  1st Qu.:    0.00   1st Qu.:    0.0       1st Qu.:    0.0    
##  Median :    0.00   Median :    0.0       Median :    0.0    
##  Mean   :  -14.24   Mean   :  700.4       Mean   :  681.4    
##  3rd Qu.:    0.00   3rd Qu.:    0.0       3rd Qu.:    0.0    
##  Max.   :    0.00   Max.   :25000.0       Max.   :25000.0    
##                                                              
##  LP_NonPrincipalRecoverypayments PercentFunded    Recommendations   
##  Min.   :    0.00                Min.   :0.7000   Min.   : 0.00000  
##  1st Qu.:    0.00                1st Qu.:1.0000   1st Qu.: 0.00000  
##  Median :    0.00                Median :1.0000   Median : 0.00000  
##  Mean   :   25.14                Mean   :0.9986   Mean   : 0.04803  
##  3rd Qu.:    0.00                3rd Qu.:1.0000   3rd Qu.: 0.00000  
##  Max.   :21117.90                Max.   :1.0125   Max.   :39.00000  
##                                                                     
##  InvestmentFromFriendsCount InvestmentFromFriendsAmount   Investors      
##  Min.   : 0.00000           Min.   :    0.00            Min.   :   1.00  
##  1st Qu.: 0.00000           1st Qu.:    0.00            1st Qu.:   2.00  
##  Median : 0.00000           Median :    0.00            Median :  44.00  
##  Mean   : 0.02346           Mean   :   16.55            Mean   :  80.48  
##  3rd Qu.: 0.00000           3rd Qu.:    0.00            3rd Qu.: 115.00  
##  Max.   :33.00000           Max.   :25000.00            Max.   :1189.00  
## 
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

There are 81 variables with 113937 observations. Plotting a histogram for each variable will be helpful in understanding the data set. However, there are so many variables so I need to filter them out before beginning the univariate analysis.

Finding good variables

##                          ListingKey                       ListingNumber 
##                                   0                                   0 
##                 ListingCreationDate                         CreditGrade 
##                                   0                                   0 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               29084                               29084 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               29084                               29084 
##               ProsperRating..Alpha.                        ProsperScore 
##                                   0                               29084 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                   0                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7625                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                                   0 
##                    DateCreditPulled               CreditScoreRangeLower 
##                                   0                                 591 
##               CreditScoreRangeUpper             FirstRecordedCreditLine 
##                                 591                                   0 
##                  CurrentCreditLines                     OpenCreditLines 
##                                7604                                7604 
##          TotalCreditLinespast7years               OpenRevolvingAccounts 
##                                 697                                   0 
##         OpenRevolvingMonthlyPayment                InquiriesLast6Months 
##                                   0                                 697 
##                      TotalInquiries                CurrentDelinquencies 
##                                1159                                 697 
##                    AmountDelinquent             DelinquenciesLast7Years 
##                                7622                                 990 
##            PublicRecordsLast10Years           PublicRecordsLast12Months 
##                                 697                                7604 
##              RevolvingCreditBalance                 BankcardUtilization 
##                                7604                                7604 
##             AvailableBankcardCredit                         TotalTrades 
##                                7544                                7544 
##  TradesNeverDelinquent..percentage.             TradesOpenedLast6Months 
##                                7544                                7544 
##                   DebtToIncomeRatio                         IncomeRange 
##                                8554                                   0 
##                    IncomeVerifiable                 StatedMonthlyIncome 
##                                   0                                   0 
##                             LoanKey                   TotalProsperLoans 
##                                   0                               91852 
##          TotalProsperPaymentsBilled               OnTimeProsperPayments 
##                               91852                               91852 
## ProsperPaymentsLessThanOneMonthLate     ProsperPaymentsOneMonthPlusLate 
##                               91852                               91852 
##            ProsperPrincipalBorrowed         ProsperPrincipalOutstanding 
##                               91852                               91852 
##         ScorexChangeAtTimeOfListing           LoanCurrentDaysDelinquent 
##                               95009                                   0 
##       LoanFirstDefaultedCycleNumber          LoanMonthsSinceOrigination 
##                               96985                                   0 
##                          LoanNumber                  LoanOriginalAmount 
##                                   0                                   0 
##                 LoanOriginationDate              LoanOriginationQuarter 
##                                   0                                   0 
##                           MemberKey                  MonthlyLoanPayment 
##                                   0                                   0 
##                 LP_CustomerPayments        LP_CustomerPrincipalPayments 
##                                   0                                   0 
##                  LP_InterestandFees                      LP_ServiceFees 
##                                   0                                   0 
##                   LP_CollectionFees               LP_GrossPrincipalLoss 
##                                   0                                   0 
##                 LP_NetPrincipalLoss     LP_NonPrincipalRecoverypayments 
##                                   0                                   0 
##                       PercentFunded                     Recommendations 
##                                   0                                   0 
##          InvestmentFromFriendsCount         InvestmentFromFriendsAmount 
##                                   0                                   0 
##                           Investors 
##                                   0

This result shows the number of NAs for each variable. This will help me filter out variables with too many NA’s.

Term

## 
##    12    36    60 
##  1614 87778 24545

There are only 3 terms people use. It is better to adjust the plot to show only 3 values on the x-axis.

Now the plot shows only 3 values on the x-axis: 12, 36 and 60. Term variable will be used as a factored variable.

Borrower Rate, Borrower APR, Lender Yield, EstimatedEffectiveYield

The graphs are similar to each other. Right skewed graph with an abnormal peak at the right. I will only use Borrower Rate for convenience.

Borrower Rate

Without the high peak around at .36 the histogram is almost a right skewed graph. According to the graph, there is a very popular Rate near at .36. Let’s find the exact value of it.

## Source: local data frame [2,294 x 2]
## 
##    BorrowerRate Count
##           (dbl) (int)
## 1        0.3177  3672
## 2        0.3500  1905
## 3        0.3199  1651
## 4        0.2900  1508
## 5        0.2699  1319
## 6        0.1500  1182
## 7        0.1400  1035
## 8        0.1099   949
## 9        0.2000   907
## 10       0.1585   806
## ..          ...   ...

.3177 is the most popular Rate being applied. Cutting the variable and using it as a factored variable might be more useful.

## 
## (-0.000498,0.0498]    (0.0498,0.0995]     (0.0995,0.149] 
##                 63              12013              25613 
##      (0.149,0.199]      (0.199,0.249]      (0.249,0.298] 
##              27088              18633              17518 
##      (0.298,0.348]      (0.348,0.398]      (0.398,0.448] 
##              10951               2050                  2 
##      (0.448,0.498] 
##                  6

Estimated Return

This looks a right skewed graph with outliers less than 0.0. The graph with more bins shows clear outliers with negative values and more than 0.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084
## 
##  (-0.1,0]   (0,0.2] (0.2,0.3] 
##       176     84577        80

There are 176 observations less than 0.0 and 80 observations more than 0.2.

The graph is drawn after Estimated Return is filtered (more than 0.01 and less than 0.2). Similar to the previous variables, there is a peak at the right. After applying the log10 scale, bimodal histogram can be seen.

Prosper Score

Prosper score is factored.

## 
##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

Prosper score is ranging from 1 to 11. Most of them are NAs. We can assume them people don’t provide enough information for their ProsperScores.

Employment Status Duration

This is very nice looking right skewed graph without a peak at the right. Square root is applied to see a clear right skewed shape.

CreditScoreRangeLower and CreditScoreRangeUpper

These two histograms are very similar. Instead of using these 2 variables, I will create a new variable ‘CreditScoreRangeMid’ that is the average of these 2.

CurrentCreditLines, OpenCreditLines, TotalCreditLinespast7years

These are nice looking right skewed histograms.

TotalTrades

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   22.00   23.23   30.00  126.00    7544

DebtToIncomeRatio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

There is an outlier at 10.0. So outlier is removed and plotted again. The graph above shows the Debt to Income Ratio that are less than or equal to 1.

StatedMonthlyIncome

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

Summary of StatedMonthlyIncome shows a huge difference between 3rd quadrant and maximum value. This means that there are large outliers in the data.

The histogram of StatedMonthlyIncome looks right skewed but it doesn’t look very good because the counts are not continuously shaped.

##      n
## 1 1140

After applying log10 to y axis, the graph now looks better. There are 1140 people with monthly income greater than 20526.67. Most of the people’s income is around at 4667.

MonthlyLoanPayment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

New Variable: ratio of monthly loan payment to monthly income

The new variable of the ratio of monthly loan payment to monthly income is created. This is different from the DebtToIncomeRatio variable because it is monthly ratio. The graph is not very good since there are many outliers.

##    (-0.01,0.5] (0.5,1.26e+04]           NA's 
##         113355            581              1

There are only 582 number of observations that are larger than 0.5.

The above histogram is plotted with subset of the data set where ratio_monthly_loan_payment is less than 0.5. The right skewed graph can be seen clearly.

Original Loan Amount

Univariate Analysis

What is the structure of your dataset?

There are 113937 observations and 86 variables. There are a lot of factor variables and below is a list of factor variables I will use for the further analysis.

CreditGrade, Term, LoanStatus, ProsperScore, BorrowerState, Occupation, EmploymentStatus, IncomeRange

What is/are the main feature(s) of interest in your dataset?

I am interested in finding which variable affects the credit score. And I will find if there is any correlation between income related variables and Loan Status.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

People have different interest rates, monthly income, monthly loan payment, etc. These variables might affect their capabilities of repaying their loans and their credit scores and prosper scores. And depending on their employment status, occupations and credit scores, their loan amount may be different. For instance, I am expecting that people with low credit scores will have higher interest rates.

Did you create any new variables from existing variables in the dataset?

I created several new variables: BorrowerRate.bucket, EstimatedReturn.bucket, CreditScoreRangeMid and ratio monthly loan payment. BorrowerRate.bucket and EstimatedReturn.bucket are divided into intervals so that it can be used as factor variables. CreditScoreRangeMid is the average of CreditScoreRangeLower and CreditScoreRangeUpper. Ratio monthly loan payment is the ratio of monthly loan to monthly income.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

After plotting many histograms from various variables, a consistent pattern can be found in the histograms. Some of them have an unusually high peak at the right of the median. I cutted such variables into distinct intervals so that I can use them with multivariable analysis.

Bivariate Plots Section

## [1] "LoanStatus"          "BorrowerRate"        "EstimatedReturn"    
## [4] "ProsperScore"        "StatedMonthlyIncome" "LoanOriginalAmount" 
## [7] "MonthlyLoanPayment"  "CreditScoreRangeMid"

According to the matrix, BorrowerRate and Estimated return are positively correlated. And there is interesting thing going on with ProsperScore and BorrowerRate. There is strong positive correlation between LoanOriginalAmount and Monthly Loan Payment.

## 
##  Pearson's product-moment correlation
## 
## data:  tmp$BorrowerRate and tmp$CreditScoreRangeMid
## t = -188.08, df = 113210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4923486 -0.4834719
## sample estimates:
##        cor 
## -0.4879229

As expected, Credit Score and BorrowerRate are negatively correlated. Outliers are removed. Correlation test shows a moderate value of negative correlation.

NA values in ProsperScore can be ignored. And a clear pattern can be observed in this box plot. As score gets higher, borrower rate gets lower.

Clear linear pattern can be observed from the graph. Interesting thing is that 3 distinct linear lines can be seen.

## 
##  Pearson's product-moment correlation
## 
## data:  df$MonthlyLoanPayment and df$LoanOriginalAmount
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

As a result, these two variables have very high values of R^2 and correlation. This is reasonable because those with more loan amount have to pay more.

Ratio between monthly loan to payment

Due to huge outlier in ratio monthly loan payment variable, the graphs in the matrix don’t provide good information. Outliers are removed and alpha is set to 0.1 and the data is subsetted to get a better graph.

The graph between ratio monthly loan payment and EstimatedReturn variables shows no obvious pattern. Ratio tends to be slightly higher at 0.10 EstimatedReturn but the trend is way too subtle.

## df$Term: 12
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##   0.00000   0.03044   0.05301   0.30700   0.09036 192.30000 
## -------------------------------------------------------- 
## df$Term: 36
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     0.025     0.045    10.420     0.077 12570.000 
## -------------------------------------------------------- 
## df$Term: 60
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0374   0.0566   0.1216   0.0768 662.6000
## Source: local data frame [3 x 2]
## 
##     Term     n
##   (fctr) (int)
## 1     12  1608
## 2     36 87387
## 3     60 24521

Although median is the largest when the Term is 60, the mean is the largest when the Term is 36.

Completed status tends to have larger value for ratio monthly loan payment. And Defaulted status has low ratio.

## df$LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2445    2600    2609    3833    4167 
## -------------------------------------------------------- 
## df$LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3750    4486    5500  208300 
## -------------------------------------------------------- 
## df$LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2917    4417    5325    6583  618500 
## -------------------------------------------------------- 
## df$LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3667    5167    6153    7447 1750000 
## -------------------------------------------------------- 
## df$LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3708    4367    5417   58620 
## -------------------------------------------------------- 
## df$LoanStatus: FinalPaymentInProgress
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1167    3583    5250    6312    8333   32920 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (>120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3115    3750    3727    4500    6667 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (1-15 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3167    4667    5554    6948   35420 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (16-30 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3250    4583    5484    6500   30000 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (31-60 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2938    4583    5436    7083   25000 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (61-90 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3167    4583    5323    6594   31250 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (91-120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3073    4171    4816    5833   22920

Completed people have greater incomes than Defaulted people.

## df$LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1075  0.1395  0.2000  0.1844  0.2375  0.2375 
## -------------------------------------------------------- 
## df$LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1769  0.2400  0.2354  0.2975  0.4500 
## -------------------------------------------------------- 
## df$LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1173  0.1744  0.1864  0.2511  0.4975 
## -------------------------------------------------------- 
## df$LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0577  0.1314  0.1760  0.1838  0.2310  0.3304 
## -------------------------------------------------------- 
## df$LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1650  0.2296  0.2231  0.2875  0.4975 
## -------------------------------------------------------- 
## df$LoanStatus: FinalPaymentInProgress
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0629  0.1299  0.1899  0.1970  0.2712  0.3199 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (>120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1449  0.2079  0.2551  0.2527  0.3060  0.3199 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (1-15 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0749  0.1870  0.2317  0.2308  0.2859  0.3435 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (16-30 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0599  0.1899  0.2419  0.2353  0.2909  0.3304 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (31-60 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0649  0.1855  0.2468  0.2330  0.2870  0.3304 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (61-90 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0659  0.1914  0.2468  0.2400  0.2999  0.3304 
## -------------------------------------------------------- 
## df$LoanStatus: Past Due (91-120 days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0766  0.1850  0.2495  0.2383  0.2952  0.3435

People with less borrower rates complete their loan payments better than those with higher rates.

## filtered_ratio$ProsperScore: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02088 0.03460 0.04419 0.05513 0.34490 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02549 0.04036 0.04934 0.06350 0.44350 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03029 0.05062 0.05840 0.07790 0.48680 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03484 0.05790 0.06302 0.08399 0.46500 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03231 0.05468 0.06244 0.08163 0.49300 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03390 0.05604 0.06336 0.08301 0.47510 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03322 0.05505 0.06098 0.08076 0.45510 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.03074 0.05130 0.05757 0.07677 0.47470 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02521 0.04286 0.04827 0.06629 0.42360 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02329 0.04147 0.04665 0.06552 0.42300 
## -------------------------------------------------------- 
## filtered_ratio$ProsperScore: 11
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002689 0.031750 0.051300 0.053250 0.072210 0.122800

ProsperScores of 5 and 6 tend to have a higher ratio and it gets lower as the score gets farther away from 5 and 6.

People with 700 credit score tend to have the highest ratio.

## 
##    19   379   439   459   479   499   519   539   559   579   599   619 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   639   659   679   699   719   739   759   779   799   819   839   859 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   879   899 
##   212    27

Interesting thing here is that the credit score is increasing until the prosper score 10 and it decreases when the prosper score is 11. People with really nice credit scores have a prosper score of 10.

This is a graph of Occupation and Loan amount. Since there are too many occupations and these occupations have to be grouped so that more clear pattern can be observed. But I decided not to use Occupation this time.

## 
## Call:
## lm(formula = LoanOriginalAmount ~ CreditScoreRangeMid, data = subset(df, 
##     CreditScoreRangeMid > 350))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14210  -4136  -1306   3157  25449 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.625e+04  1.953e+02  -83.23   <2e-16 ***
## CreditScoreRangeMid  3.537e+01  2.795e-01  126.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5850 on 113211 degrees of freedom
## Multiple R-squared:  0.1239, Adjusted R-squared:  0.1239 
## F-statistic: 1.602e+04 on 1 and 113211 DF,  p-value: < 2.2e-16

R^2 is very small so it is difficult to say there exists a linear relationship between this two variables, but generally people with good credit score have greater loan amounts.

## 
## Call:
## lm(formula = log(filter_monthlyIncome$StatedMonthlyIncome) ~ 
##     filter_monthlyIncome$CreditScoreRangeMid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1391 -0.3450  0.0179  0.3694  5.8006 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                              6.882e+00  1.914e-02  359.60
## filter_monthlyIncome$CreditScoreRangeMid 2.258e-03  2.741e-05   82.37
##                                          Pr(>|t|)    
## (Intercept)                                <2e-16 ***
## filter_monthlyIncome$CreditScoreRangeMid   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6073 on 111650 degrees of freedom
##   (578 observations deleted due to missingness)
## Multiple R-squared:  0.05728,    Adjusted R-squared:  0.05727 
## F-statistic:  6784 on 1 and 111650 DF,  p-value: < 2.2e-16
## 
##  Pearson's product-moment correlation
## 
## data:  filter_monthlyIncome$StatedMonthlyIncome and filter_monthlyIncome$CreditScoreRangeMid
## t = 36.827, df = 111650, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1037514 0.1153419
## sample estimates:
##       cor 
## 0.1095504
## 
##  Pearson's product-moment correlation
## 
## data:  log(filter_monthlyIncome$StatedMonthlyIncome) and filter_monthlyIncome$CreditScoreRangeMid
## t = 82.366, df = 111650, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2337985 0.2448578
## sample estimates:
##       cor 
## 0.2393359

CreditScoreRangeMid vs. StatedMonthlyIncome was graphed without any modification. And then log of monthly income is graphed. The latter showed more linearlized graph and the correlation was higher.

## 
## Call:
## lm(formula = LoanOriginalAmount ~ StatedMonthlyIncome, data = subset(df, 
##     StatedMonthlyIncome > 10))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -294921   -4585   -1890    3305   26194 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.418e+03  2.294e+01   323.4   <2e-16 ***
## StatedMonthlyIncome 1.666e-01  2.435e-03    68.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6121 on 112228 degrees of freedom
## Multiple R-squared:  0.04002,    Adjusted R-squared:  0.04001 
## F-statistic:  4678 on 1 and 112228 DF,  p-value: < 2.2e-16
## 
##  Pearson's product-moment correlation
## 
## data:  tmp$LoanOriginalAmount and log(tmp$StatedMonthlyIncome)
## t = 155.04, df = 112230, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4151712 0.4248083
## sample estimates:
##       cor 
## 0.4200016

There is a linear relationship between the log of monthly income and loan amount. There are clear horizontal lines at multiple of 5000. I believe when the data is gathered the loan amounts are rounded. People whose monthly income is less than 8000 don’t usually have laon amounts greater than 25,000.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the plots above, people with lower borrower rates and higher incomes tend to complete their loan payments better than those with higher borrower rates and lower incomes.

I created a new variable, the ratio of monthly loan payment to monthly income. I was interested in this rate because it can be an important factor to people when they repay their loans. Students can have lower incomes if they don’t get jobs they expected to get after graduation. In this case, a great portion of their incomes has to be paid as their loans. I thought that I can get a slight idea of unemployemnt rates from this analysis.

As a result, people who completed repaying their loans have lower rate than people who defaulted their loans. And people with the highest Prosper Score and Credit Score have lower ratio.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I thought that when a borrower rate is high then the return rate will be lower, but the opposite phenomenon can be seen from the graph. They are showing the positive correlation.

And I thought that the ratio of monthly loan payment to monthly income will be greater for those with better credit scores. However, the ratio is the greatest when the score is 5 or 6. Then I analyzed the relationship between the score and the monthly income. As a result, those with greater income amounts have better credit scores. This explains the ratio at 5 or 6 is greater than higher scores because people with a good income don’t increase their loan payment amount.

What was the strongest relationship you found?

The strongest relationship I found was Monthly Loan Payment and Loan Amount. And the Borrower Rate and Prosper Score also showed a strong relationship.

Multivariate Plots Section

I plotted two similar plots: one with low ProsperScores and the other with high ProsperScores. This is because scale color brewer sequential only takes 9 colors and it cannot color beyond 9.

## 
##   (0,2]   (2,4]   (4,6]   (6,8]  (8,10] (10,12] 
##    6758   20237   22091   22650   11661    1456

Cutted the ProsperScore in order to use scale color brewer.

I plotted ratio monthly loan payment of credit score for each BorrowerRate bucket. Good prosper scores can be found in the range of 0.0571 to 0.108 of a borrower rate. And they also have a good credit scores. As the borrower rate increases, the prosper score and credit score decrease.

The line with the largest slope mainly contains low prosper scores. Note that those with ProsperScore of 11 usually are on the lines that have slope of 1/30 or 1/50.

The graph is faceted by BorrowerRate. More obvious pattern can be observed. People with good ProsperScore are in the ranges of the low borrower rates and those with bad scores are in the ranges of the high rates. As the Borrower rate gets larger, the score gets lower. And people with higher borrower rates don’t have large loan amounts. When you look at Borrower Rate bucket from 0.298 to 0.398, the loan amounts don’t exceed 20,000.

From the previous plot, the positive correlation between CreditScore and ProsperScore is already observed. What’s interesting to note in this graph is that those with good prosperScore don’t usually pay more than 20% of their income for their loan payments.

Chargedoff, Completed and Defaulted are similar. Most of them are Full time. Interesting thing to note is that most of the people who is currently paying their loans are self-employed.

## 
##        (0,1e+03]    (1e+03,2e+03]    (2e+03,3e+03]    (3e+03,4e+03] 
##             1984             6362            16141            18155 
##    (4e+03,5e+03]    (5e+03,6e+03]    (6e+03,7e+03]    (7e+03,8e+03] 
##            19727            12859            10361             7750 
## (8e+03,1.75e+06] 
##            19203

The StatedMonthlyIncome is cutted into 9 intervals.

This graph also shows that people with high income levels have high Prosper Score and low Borrower Rate.

## Source: local data frame [68 x 3]
## 
##                    Occupation count CreditScoreRangeMid_mean
##                        (fctr) (int)                    (dbl)
## 1                       Judge    22                 734.9545
## 2                  Pharmacist   257                 729.6556
## 3                      Doctor   494                 727.1113
## 4                    Investor   214                 724.5467
## 5                    Attorney  1046                 718.7161
## 6  Pilot - Private/Commercial   199                 718.6457
## 7                     Dentist    68                 718.6176
## 8                   Professor   557                 717.4354
## 9                   Principal   312                 716.5513
## 10      Engineer - Electrical  1125                 716.3800
## ..                        ...   ...                      ...
## Source: local data frame [68 x 3]
## 
##                     Occupation count CreditScoreRangeMid_mean
##                         (fctr) (int)                    (dbl)
## 1  Student - College Sophomore    69                 628.3406
## 2   Student - Technical School    16                 640.7500
## 3  Student - Community College    28                 641.6429
## 4   Student - College Freshman    41                 645.1098
## 5     Student - College Junior   112                 651.6429
## 6                    Homemaker   120                 664.3333
## 7                     Clerical  3164                 669.5759
## 8              Waiter/Waitress   436                 671.1055
## 9     Student - College Senior   188                 672.6915
## 10                     Laborer  1595                 678.8542
## ..                         ...   ...                      ...

There are 68 different occupations in the data set. According to the table above the counts for some occupations are less than 1000. I will filter out such people whose count is less than 1000.

## Source: local data frame [27 x 3]
## 
##                           Occupation count CreditScoreRangeMid_mean
##                               (fctr) (int)                    (dbl)
## 1                           Attorney  1046                 718.7161
## 2              Engineer - Electrical  1125                 716.3800
## 3                          Executive  4311                 712.3810
## 4              Engineer - Mechanical  1406                 709.8841
## 5                Computer Programmer  4478                 708.2182
## 6                         Nurse (RN)  2489                 707.2340
## 7                     Accountant/CPA  3233                 705.3862
## 8  Police Officer/Correction Officer  1578                 700.9702
## 9                            Teacher  3759                 699.7580
## 10                      Professional 13628                 699.5719
## ..                               ...   ...                      ...
## Source: local data frame [27 x 3]
## 
##                  Occupation count CreditScoreRangeMid_mean
##                      (fctr) (int)                    (dbl)
## 1                  Clerical  3164                 669.5759
## 2                   Laborer  1595                 678.8542
## 3              Food Service  1123                 679.4733
## 4            Sales - Retail  2797                 681.7703
## 5         Military Enlisted  1272                 681.8585
## 6  Administrative Assistant  3688                 683.1388
## 7         Retail Management  2602                 689.6076
## 8              Truck Driver  1675                 690.6343
## 9        Sales - Commission  3446                 690.6782
## 10            Skilled Labor  2746                 690.8474
## ..                      ...   ...                      ...

This list has 4 occupations with the highest credit scores and 4 occupations with the lowest credit scores for the further analysis.

ProsperScore vs. BorrowerRate is plotted with color of Occupations. Unfortunately, too many points are NA so this plot cannot be used for the analysis.

full-time people are at the right bottom corner and Self-employed people are at the left hand side. And those with ProsperScore 11 are mostly self-employed.

Linear Model

Based on the observations so far, I can build a model using, OriginalLoanAmount, MonthlyLoanPayment, ProsperScore, Borrower Rate, EmploymentStatus and CreditScore.

## 
## Calls:
## m1: lm(formula = ProsperRating..numeric. ~ BorrowerRate, data = df_reduced)
## m2: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount, 
##     data = df_reduced)
## m3: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount + 
##     log(MonthlyLoanPayment), data = df_reduced)
## m4: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount + 
##     log(MonthlyLoanPayment) + CreditScoreRangeMid, data = df_reduced)
## m5: lm(formula = ProsperRating..numeric. ~ BorrowerRate + LoanOriginalAmount + 
##     log(MonthlyLoanPayment) + CreditScoreRangeMid + EmploymentStatus, 
##     data = df_reduced)
## 
## ========================================================================================================
##                                                 m1          m2          m3          m4          m5      
## --------------------------------------------------------------------------------------------------------
##   (Intercept)                                 8.267***    8.094***    8.632***    6.374***    6.234***  
##                                              (0.005)     (0.007)     (0.026)     (0.041)     (0.041)    
##   BorrowerRate                              -21.394***  -21.014***  -20.946***  -20.073***  -20.017***  
##                                              (0.023)     (0.025)     (0.026)     (0.028)     (0.028)    
##   LoanOriginalAmount                                      0.000***    0.000***    0.000***    0.000***  
##                                                          (0.000)     (0.000)     (0.000)     (0.000)    
##   log(MonthlyLoanPayment)                                            -0.119***   -0.114***   -0.095***  
##                                                                      (0.006)     (0.005)     (0.005)    
##   CreditScoreRangeMid                                                             0.003***    0.003***  
##                                                                                  (0.000)     (0.000)    
##   EmploymentStatus: Full-time/Employed                                                        0.111***  
##                                                                                              (0.006)    
##   EmploymentStatus: Not employed/Employed                                                    -0.027     
##                                                                                              (0.019)    
##   EmploymentStatus: Other/Employed                                                           -0.085***  
##                                                                                              (0.008)    
##   EmploymentStatus: Part-time/Employed                                                        0.080*    
##                                                                                              (0.031)    
##   EmploymentStatus: Retired/Employed                                                          0.072**   
##                                                                                              (0.026)    
##   EmploymentStatus: Self-employed/Employed                                                   -0.132***  
##                                                                                              (0.007)    
## --------------------------------------------------------------------------------------------------------
##   R-squared                                      0.909       0.910       0.911       0.916       0.917  
##   adj. R-squared                                 0.909       0.910       0.911       0.916       0.917  
##   sigma                                          0.505       0.501       0.500       0.485       0.483  
##   F                                         841017.986  427508.345  286711.241  229104.529   92654.177  
##   p                                              0.000       0.000       0.000       0.000       0.000  
##   Log-likelihood                            -62027.080  -61393.349  -61163.761  -58721.850  -58294.576  
##   Deviance                                   21493.951   21173.415   21058.475   19873.908   19673.596  
##   AIC                                       124060.161  122794.699  122337.522  117455.699  116613.151  
##   BIC                                       124088.189  122832.070  122384.236  117511.756  116725.265  
##   N                                          84356       84356       84356       84356       84356      
## ========================================================================================================

R_squared gets increased as variables are added. The result R-squared is 0.917 which means a positive correlation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I found that those with good Prosper Scores tend to complete their payments better than those with bad Prosper Scores. So I decided to focus on finding which variables affect the prosper scores. I found that there monthly income, credit score, original loan amount, monthly loan payment are positively correlated to Prosper Score and Borrower Rate is negatively correlated to Prosper Score. In other words, the capability of completing the loan payments depend on these variables.

Were there any interesting or surprising interactions between features?

The most interesting thing I found was that that are many people who pay their loans with 0 income. And it was also interesting to find out that most of them actually completed their loan payments. Another interesting interaction was between the credit score and ratio between monthly loan payment to monthly income. People in mid range of credit scores have the highest ratio. People

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a model starting from ProsperRating..numeric. and BorrowerRate. ProsperRating..numeric. is used instaed of ProsperScore because it is a numeric variable while ProsperScore is a factor variable. Then I added 4 more variables: LoanOriginalAmount, log(monthlyLoanPayment), CreditScoreRangeMid and EmploymentStatus. R-squared was 0.909 at first. When more variables are added to the model, it gets increased until 0.917.


Final Plots and Summary

Plot One

Description One

This is a histogram of ProsperScore variable. It looks similar to the normal distribution graph. Most of the people have ProsperScore between 4 and 8 and a small number of poeple are outside this range.

Plot Two

Description Two

This graph shows that ProsperScore and BorrowerRate are negatively correlated. As the box plot shows the median values decrease as the ProsperScore increases.

Plot Three

Description Three

Through this graph, the linear model can be constructed using ProsperScore, CreditScore and Monthly Income. This graph shows the that people with good ProsperScore and CreditScore have large monthly income and people with bad ProsperScore and CreditScore have low monthly income. Note the smooth line keeps increasing as the Credit Score increases.


Reflection

There are total 113937 records with 81 variables in the data. I first filtered out the variables with too many NA values. Then, I analyzed each variable by creating a histogram and try to find variables that are important to this data set. There are some variables that got my attention: BorrowerRate, BorrowerAPR, CreditScore, ProsperScore, etc. I put the most emphasis on ProsperScore because according to the description of the data set, ProsperScore indicates the risk of repaying loans where 10 means the lowest risk and 0 means the highest risk. Since there are total 81 variables in the data set there are so many interesting variables I wanted to investigate further such as listing category that indicates the types of loan and BorrowerState that indicates the state of the address of the borrower. However, I decided to focus on finding the variables that affect the ProsperScore.

While the univariate analysis provides me which variables would be interesting to use, bivariate analysis provides the relationship between pairs of data variables I chose to use. One interesting I found during the analysis was that people who have good ProsperScore tends to have low BorrowerRate and high credit score. And people who pay their loans using higher ratio to their monthly income usually have middle range of credit scores and prosper scores. Those with high credit scores and prosper scores don’t use high ratio. Later investigation showed that although people with high ratio and with low ratio pay similar amounts of loan payments those with high prosper score usually have higher monthly income. So, people with high income and high prosper score not neccessarily pay more loans.

There are few problems I confronted during the analysis. When using MonthlyIncome variable, there a lot of outliers that are way beyond the average and median of the data. At first I filtered those outliers and I realized that I have to get rid of too many of the data. So I used log10 and it provided the better histogram and plot of the data.

As I mentioned there are still a lot of interesting variables that can provide interesting results from further analysis. I would be interesting to find which state has the greatest loan amount borrowed and highest percentage of completing their loans. And I would like to find which type of loan has the highest percentage of completion and highest average mean and median of prosper score and Credit score.